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Abstract 

Statistical studies of languages have focused on the rank-frequency distribution of words. Instead, we 
introduce here a measure of how word ranks change in time and call this distribution rank diversity. 
We calculate this diversity for books published in six European languages since 1800, and find that it 
follows a universal lognormal distribution. Based on the mean and standard deviation associated with the 
lognormal distribution, we define three different word regimes of languages: “heads” consist of words which 
almost do not change their rank in time, “bodies” are words of general use, while “tails” are comprised 
by context-specific words and vary their rank considerably in time. The heads and bodies reflect the 
size of language cores identified by linguists for basic communication. We propose a Gaussian random 
walk model which reproduces the rank variation of words in time and thus the diversity. Rank diversity 
of words can be understood as the result of random variations in rank, where the size of the variation 
depends on the rank itself. We find that the core size is similar for all languages studied. 


Introduction 


Statistical studies of languages have become popular since the work of George Zipf |1| and have been 
refined with the availability of large data sets and the introduction of novel analytical models |2j 7 . Zipf 
found that when words of large corpora are ranked according to their frequency, there seems to be a 
universal tendency across texts and languages. He proposed that ranked words follow a power law / ~ 1 /k, 
where k is the rank of the word—the higher ranks corresponding to the least frequent words—and / is 
the relative frequency of each word 9 . This regularity of languages and other social and physical 
phenomena had been noticed beforehand, at least by Jean-Baptiste Estoup |l0| and Felix Auerbach [ll] , 
but it is now known as Zipf s law. 

Zipf s law is a rough approximation of the precise statistics of rank-frequency distributions of languages. 
As a consequence, several variations have been proposed |l2 15 . We compared Zipf’s law with four other 
models, all of them behaving as l/k a for a small k, with a « 1, as detailed in the SI. We found that all 
models have systematic errors so it was difficult to choose one over the other. 

Studies based on rank-frequency distributions of languages have proposed two word regimes [15 16 : a 
“core” where the most common words occur, which behaves as l/k a for small k , and another region for 
large k, which is identified by a change of exponent a in the distribution fit. Unfortunately, the point 


where exponent a changes varies widely across texts and languages, from 5000 16 to 62,000 115;. A recent 


study 17 measures the number of most frequent words which account for 75% of the Google books corpus. 


Differences of an order of magnitude across languages were obtained, from 2365 to 21077 words (including 
inflections of the same stems). This illustrates the variability of rank-frequency distributions. The core of 
human languages can be considered to be between 1500 and 3000 words (not counting different inflections 
of the same stems), based on basic vocabularies for foreigners fl8] , creole Il9] , and pidgin languages 
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For example, Voice of America’s Special English 21 and Wikipedia in Simple English use about 1500 and 
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2000 words, respectively (not counting inflections). The Oxford Advanced Learner’s Dictionary lists 3000 
priority lexical entries ||22j. This suggests that the change of exponent a or another arbitrary cutoff in 
rank-frequency distributions does not reflect the size of the core of languages. 

In view of these problems with rank-frequency distributions, we propose a novel measure to characterize 
statistical properties of languages. We have called this measure rank diversity and it tells us how words 
change their rank in time. With rank diversity, three regimes of words are identified: “heads”, “bodies” 
and “tails”. This measure of rank diversity follows the same simple functional law with similar parameters 
for all data analyzed. In particular, this is so for the six European languages studied here using a large 
data set of more than 6.4 xlO 11 words from Google Books [23], which contains about 4% of all books 
written until 2008. It should be noted that this data set includes all different inflected forms (such as 
plural, different tense/aspect forms, etc.) found in the book corpus. Data sets such as this have allowed 
the study of “culturomics”: how cultural traits such as language have changed in time [24j-[30]. 

The rank diversity follows a scale-invariant behavior regarding its fluctuations, which inspires a model 
based on random walks, with scale-invariant random steps. This model reproduces the behavior of diversity 
and thus captures the essence of the evolution of word rank across different languages. 


Rank diversity of words 

In what follows we shall consider six European languages from the Indo-European family. They are 
English and German; Spanish, French and Italian; and Russian. They belong to different linguistic 
branches: Germanic, Romance, and Slavic, respectively. The native speakers of these languages account 
for approximately 17% of the world population. 

We shall start by taking into account the 20, say, most used words in the six languages, that is, the 
lowest-ranked words. Using, for the sake of uniformity, the first sense or first meaning given by Google 
Translate, once these words are translated into English, the coincidences in all six languages are remarkable 
(see Table SI in File SI). This could have been foreseen, since most of the lowest-ranked words are articles, 
prepositions or conjunctions, i.e. what is called function words. A different matter, as we shall see, would 
result if we had considered only nouns, verbs, adverbs or adjectives, known as content words. 

In order to quantify this fact, we present in Fig. [ljthe time evolution of the overlap of the first 20 
lowest-ranked words in the five languages with respect to the corresponding list of English. From the 
upper part of this figure we can see that along two centuries this overlap fluctuates around 0.9, a rather 
large number, except for Russian, since this language does not have articles. These data reveal that these 
Indo-European languages have shared structural properties, notwithstanding that they belong to distinct 
linguistic branches. 
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1800 1850 1900 1950 2000 

year 

Figure 1. Overlap of the 20 most frequent words (continuous lines), and of the 20 most 
frequent content words (dashed lines) across languages, with respect to English, as a function of 
time. When words have more than one meaning, the first sense, according to Google Translate, was used. 
The color code for languages is as follows: light blue for French, green for German, yellow for Italian, dark 
blue for Spanish, and dark orange for Russian. Additionally, light orange will be used for English when 
required (see also Fig. [2j . The same color coding for languages will be used throughout the rest of the 
article. 

The lowest-ranked words used to construct the upper part of Fig. [TJare essentially the same along 
centuries (See Figs S3-S8 in SI). But this is not the case for content words, as can be seen in Table S2 in 
File SI. First, and as also shown by the dashed curves in Fig. |T] the overlap of these words with respect to 
English for the other five languages (including Russian) is of the order of 0.5. These values are much lower 
than the overlap of function words. Second, the most common nouns vary considerably with time. On the 
one hand, nouns like time , man, life and their translation to the other languages are present independently 
of the century. On the other hand, words like god and king have a low rank in the eighteenth century but 
have a larger rank in the last century. The rank change in time of these nouns reflect cultural facts. 

What is discussed in the previous paragraph is an example of what could be called rank diversity d(k). 
This is, in the present study, the number of different words occurring at a specific rank k over a given 
period of time At. We found that the resulting rank diversity curves for the six languages studied between 
1800 and 2008 are similar to each other, as shown in Figs. [2] and [6] Low ranks have a very low diversity, 
as few words appear in the same ranks for the years we have studied. 









4 



Figure 2. Rank diversity. Diversity d as a function of the rank k for different languages from 1800 to 
2008, where d(k) measures how many different words appear for a given rank k during the time 
considered (At = 208). For example, for English, d(l) = 1/208, as the word ‘the’ appears in the first rank 
for all years considered. Although we have analyzed up to k ~ 10 6 , rank diversity for k > 10 4 is not 
shown as d(k) ~ 1, i. e., a different word appears in each rank every year. Data are windowed over time, 
with a slot of size dlog 10 k = 0.1, for the sake of clearness. Additionally, the sigmoid defined in equation [T| 
is shown as a black dashed curve, with the best fit parameters, also reported in each subfigure. The mean 
square error e between the data and the fit, is also given. The shaded region corresponds to the average 
“body” of all languages. 

As shown by the continuous lines in Fig. [2j the sigmoid curve fits very well d(k) for all languages 
considered, except for low k where the statistical fluctuations are larger due to the small sample size. 
The sigmoid is the cumulative of a Gaussian distribution, i.e. 

1 /-IoglO fe 

®i*A lo Sio k ) = ~^= / e 2 " 2 d ?A (!) 

and is given as a function of log A;. The values of p, and a reported in Fig. [2] were obtained adjusting 
equation [l] to the rank diversity calculated for each individual language. The mean value p identifies the 
point where d{k) ~ 0.5, while the width a gives the scale in which d(k) gets close to its extremal values. 
When log A; is much larger than fi + a, (log A;) gets exponentially close to one, whereas when log A: is 
much smaller than /x — cr it gets exponentially close to zero. It is customary in statistics to define a bulk 
of the Gaussian between p ± 2<r, where 95% of the population lies. Along the same lines, we define three 
regions, marked by 

logio k ± = V ± 2cr - ( 2 ) 

First, we find what we shall call the head of the language, distributed with ranks between 1 and fc_; a 
second region, identified as the body of the language, lies between A;_ and k + ; and finally the tail, beyond 
k + . From the values reported in Fig. [2j we see that 9 < k- < 22, while k + lies between 1832 and 3099. 
As shown in Fig. [3j these regions are robust to changes in the historical period considered and to the data 
set size (larger for recent years). 
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Figure 3. Evolution in time of the center of the sigmoid (middle panel), and the borders 
of the head and body (bottom panel) and body and tail (top panel) for the different 
languages along time for intervals of fifty years, i.e. At = 50. Head words have k < fc_, body 
words have /c_ < k < k + , and tail words have < k. See Fig. [2|for color coding. 


The bodies of languages consist of words that have limited change in time. Based on the size of basic 
vocabularies, it can be argued that the “core” of English is between 1500 and 3000 words, as mentioned in 
the introduction, which is consistent with our results. If we agree that the rank diversity identifies the 
core (head and body) of English, then it can be argued that the size of the core of the other five languages 
studied is similar [3l], which is also supported by the high similarity across languages in Fig [2] 

The tails of languages are formed by words which vary their rank considerably in time. This implies 
that they are more dependent on the text and its domain than words from the core. It can be assumed 
that words belonging to the head and body of languages have a high probability of being used in any text, 
while words from the tail would appear only in specific texts and domains. 

Note that we obtain language cores slightly larger than those proposed by linguists. This is to be 
expected, as the Google Books data set treats words forms inflected for different persons, tenses, genders, 
numbers, cases, and so forth, as distinct items, while dictionaries count only stems (presented as citation 
forms, i.e. the basic form that users are most likely to look up). For example, the core for English obtained 
using rank diversity consists of 2448 words, but within these there are only 1760 different stems in the 
year 2008. Moreover, the studied data set contains several proper names which are not included in basic 
vocabulary lists. For English, 55 out of 2448 are proper names in 2008. 

The rank evolution of particular words in time, belonging to the head, body, and tail of English is 
shown in Fig. [4a] This ratifies the results shown in Fig. [2] where low-ranked words exhibit little variation 
in time and this variation increases with the rank. More trajectories are presented in the SI. As mentioned 
above, words from the head vary little over time. However, the way in which words from the body or tail 
vary their rank in time appears to be similar, although at a different scale. This similarity leads us to 
propose a model of rank diversity where the amount of rank variation depends only on the rank. 
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Figure 4. Rank evolution, [a]: Evolution of the rank for several particular, but random words in 
different regimes in the English language. From bottom to top we show words with initial ranks of order 1 
(head), 100 (body) and 1000 (tail), [b]: Evolution of the rank for several particular, but random words in 
different regimes, for our scale-free Gaussian walker, i.e. the simulated language we have generated. 


A random walk model for rank diversity 

We consider the relative size of frequency changes, or flights as they are sometimes called in statistical 
physics, defined as (fct+i — kt) /kt where kt is the rank at discrete time t of a given element. We present 
in Fig. [5] the distribution of these frequency changes for English, our largest data set, and in Fig. m in 
File SI for all languages. Notice that, on average, the relative jumps seem to be largely independent of 
the value of the rank. We propose, based on this fact, a simple model to understand the evolution of rank 
diversity of words. 
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Figure 5. Distribution of relative size of frequency changes [k t +\ — kt]/kt in the case of English 
for words in the head ■ (that start with rank between 1 and 10), the body (rank between 200 and 210), 
and the tail ■ (rank between 5000 and 5010). Notice that for words in the head, the granularity of the 
model (equation [3J shows up as large deviations from the Gaussian. For the body and tail, the relative 
jumps are similar independently of the initial rank of the word. We also show, as a thick green curve, the 
Lorentzian distribution which best fits the average of the curves for the body and tail. A Gaussian, with 
zero mean and the most common standard deviation a = 0.0575, is also shown in red for comparison (see 
text for details). The corresponding plot for other languages is shown in the supplementary information. 

We shall call this model a scale-invariant random Gaussian walk, since a word with rank k t , is converted 
to rank kt+ 1 according to the following procedure: One defines an auxiliary variable at time t + 1 by 
the relation 

Si+i = kt + G (0, fctd), (3) 

where G(0,a) is a Gaussian random number generator of width a and mean 0. This means that the 
random variable s t + 1 has a width distribution proportional to k t . Words with very low ranks will change 
very slowly or not at all, while those with higher k have a larger rank variation in time, as reflected by 
d(k). Once the values of St+i for all words are obtained, they are ordered according to their magnitude. 
This new order gives new rankings, i.e. the k values at time t + 1. There is a small correlation of the 
jumps between different times in this model. This is consistent with the observed behavior of the six 
languages dealt with here, as can be seen in Fig. [TS] in File SI. The only parameter in the model is the 
width cr, which is the most common standard deviation of the relative frequency changes of each data set. 

A word of caution must be said. In Fig. [5] two curves are plotted. In green, a Lorentzian distribution, 
and in red a Gaussian distribution, both centered at zero, and with a width obtained by best fit to the 
data presented here. Although the Lorentzian fits these data somewhat better than the Gaussian, we 
use the latter in our model, since the long tails of the Lorentzian would yield long flights in words (not 
observed in the historical data) and a very different function d(k). One should recall that the Lorentzian 
does not have a finite second moment, so this might be the reason for this distribution to be inadequate. 
It is probable that a truncated Lorentzian could be a better choice, but we leave this detail open as a 
possible refinement to our model. 

With this model we have produced the evolution of a random simulated language; see j32] for other 
approaches. Fig. |4b| shows examples of rank trajectories at different scales, exhibiting similarities with 










those of actual words shown in Fig. 4a Moreover, if its diversity d(k) is calculated with the a corresponding 
to the most popular width of the distribution of relative size of flights for all words in the English language 
from 1800 to 2008, the results coincide with the sigmoid obtained for all six languages analyzed, as shown 
in Fig. [6] 



iogio tzR 

a 

Figure 6. Rank diversity for the simulated language. The green curve represents the diversity 
corresponding to the language dynamics of a single realization of the Gaussian random walk model. We 
also include data for all languages studied, but normalized so that k± coincide. The ansatz for the rank 
diversity is plotted as a parameter-free cumulative of a Gaussian with zero mean and unit variance as a 
dashed black curve. 


Discussion 


Within statistical linguistics, the frequency-rank distributions of several languages of European origin have 
been analyzed for many years now. However, no simple model can reproduce the detailed properties of 
this distribution (see SI). In particular, there has been the proposal that there exist two different regimes 
for ranks, but these regimes have not been satisfactorily validated in the empirical data. Due to these 
difficulties we have been led to introduce a statistical measure, which we have called rank diversity, to 
describe the statistical properties of natural languages. A simulated random language was generated 
which reproduces the observed features quite well. 

Our random walk model mimics the evolution of languages to produce a simulated rank diversity 
which closely matches that of historical data. We consider that statistical similarities across languages and 
the simplicity of the model to reproduce them sufficient evidence to claim that rank diversity of words is 
universal. This does not imply that all languages have the same rank diversity curves, but that the rank 
diversity distribution of all the languages studied here can be fitted properly with equation [l] Certainly, 
different languages have different curves that fit them better, just as different exponents fit better a Zipf 
distribution of different languages. For the languages studied, 1.6 < /r < 2.1 and 0.4 < a < 0.6. 

This universality could be used to favor nativist explanations of human language |33||34| , where 
language is claimed to be determined by innate constraints. However, the high-ranked diversity of 


language tails could be used in favor of adaptationist explanations as well 35 , as the precise rank of tail 
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words is highly contingent. In recent years, explanations of human language relating biological evolution 
(genetically encoded innate properties) and learning (epigenetical adaptation) with culture have gained 
Even so, few assumptions are necessary to explain some general aspects of the evolution 
The present work shows that the evolution of word frequency can be explained 


strength [36 -38 


of human languages 39 


with Gaussian random walks, where the size of the change in word frequency is proportional to its rank, 
i.e. frequent words change less than infrequent words. This explanation does not require innate properties, 
adaptive advantages, nor culture. This does not imply that the latter are irrelevant for other aspects of 
language evolution. Note that our study is carried out at a statistical level. We do not address syntactic, 
semantic, and grammatical aspects of human language 40-43 , which are certainly important. 


Why does the rank diversity approach a lognormal distribution? Which processes and mechanisms are 
required for this? There is one condition for a variable to have a lognormal distribution. This condition is 
that the variable should be the result of a high number of different and independent causes which produce 
positive effects composed multiplicatively. Thus, each cause has a negligible effect on the global result |44] , 
Our Gaussian random walk model supports this as a suitable explanation: the statistical distribution of 
d is always lognormal, there is a high number of components (words), each word has a negligible effect 
compared to the language properties, i.e. large changes in word frequency (ranking) do not cause large 
changes in the statistical properties of each language, and the rank of each word is partially a cumulative 
product of its rank in previous times, as expressed in equation [3] Languages statistically comply with 
these dynamics, and that serves as an explanation for their evolution and structure. 

In future work, it will be relevant to study the rank diversity of ?r-grams with n > 1 [45], other 
linguistic corpora and phenomena with dynamic rank distributions [27 46 -48 and more generally with 
temporal networks |49fj52| . A specific example would be the ranking of chess players, given by the World 
Chess Federation (Federation Internationale des Echecs). The rank diversity in this case is provided in 
figure [7] which shows that the sigmoid is appropriate also for this case. 



Figure 7. Rank diversity of male chess players obtained from the trimestral FIDE rankings from 
April, 2001 to May, 2012 (At = 50), considering the first 10,000 ranks. Blue dots show rank diversity, 
windowed in the red line. The black line shows the sigmoid fit with /i = 1.24 and a = 0.76. The green line 
shows a simulation with a = 0.18. Notice that there is no head as /z — 2a < 0. This is to be expected, as 
many players enter and leave the ranking during the years considered. 
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Figure 8. Rank distributions of words according to frequency, [a]: Normalized word frequency /r as a 
function of the rank k for several languages for books published in the year 2000. The color code for 
languages is as follows: for French, ■ for German, I for Italian, ■ for English, ■ for Spanish, and ■ 

for Russian, [b]: Word frequency /r as a function of the rank k for English and several years, normalized 
so that the most frequent element has relative frequency one. In the inset, the unnornralized frequency / 
is shown. 


Supporting Information 


SI Models for rank-frequency distributions 


The rank-frequency distributions of words for different languages are very similar to each other, as shown 
in Fig. [8a] The distributions are also similar across centuries, as shown in Fig. |8b| 

We present five different distributions with distinct origins, though all of them containing the common 
factor W. The distributions are: 


TOi(fc) 

m 2 {k) 

m 3 (k) 

mi(k) 

m 5 (k) 


■A/i 


M 2 

A/3 
Ma 


e -b(k-i) 

k a ’ 

(N + 1 - k) a 

1 ? ’ 

(N+l-k^e-^-V 


Mi 


5 \ K 


k a 

k < k c 
k > k c 


(54) 

(55) 

(56) 

(57) 

(58) 


where Mi are normalization factors, depending on the parameters a, b , and a of the different models, and 
N is the total number of words. 

In Fig. [9] we compare the fit of these distributions with the observed curves. It can be seen that none of 
the distributions reproduces closely the dataset. We calculated for all fits the y 2 test with similar results. 
The best value corresponds to the fit proposed in [15], namely the double Zipf model (equation (S8)). 
In all cases we studied the p-value of the data, needed for an appropriate interpretation of the goodness 
of the fit. In all cases, that is for all years, all languages and all models, this number was smaller than 
machine precision. This shows that none of these models captures satisfactorily the data behavior. 

The origin of some of these models is similar. The following discussion shows how they can be 
encompassed in a common formulation. 
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Figure 9. Comparison between the different models, equation (S4 (-equation (S8l, and the frequency of 
rank distribution. We use the data for the year 2000 and all languages under consideration. The 
logarithm base 10 of the ratio of the observed values and the model is plotted. It can be appreciate that 
different models fit better in different regions. However there is no model that fits all languages and all 
regions much better than the others. 


Given a set of words forming a text, one can evaluate the number of times N(k, t) that a certain word 
appears with the rank k at time t. If B{k )and D{k ) denote, respectively, the probability per unit time 
that a word enters or leaves the rank k, we have: 


+ { D{k + 1 )N{k + 1, t) + B(k - l)N(k - 1, t) - \D(k) + B{k)\ N(k, f)}. (S9) 

Here the two terms on the r.h.s. within the first curly brackets describe, respectively, the local growth 
rate and the overall decrease rate acting on N(k,t). The total number of words at a given time t is E (t) 
and F is a function that determines global constraint features that refer to the total number of words. 
The terms within the second curly brackets, indicate the balance arising from the birth B(k) and death 
D{k ) contributions of first neighbor words with k ± 1 ranks at time t. If we consider the total number of 
words at a given time t to be a fixed quantity, 

m = £ N(k,t) (S10) 

k 


we can define the probability density of finding a word with rank k, or relative frequency distribution, by 


n(k, t) 


N(k,t) 

“W 


Substitution of equation (S10) and equation (Sll) into equation (S9l leads tc 


(Sll) 


j t m = m)) - Fimim 


(S12) 
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where the bracket indicates a sum over all k weighted by n(/c,f). We assume, for simplicity, that (C(fc)) is 
a linear function of the number of edges /c, so that (£(&)) = Co + £i{k), where Co and Ci are constants. 


Then equation (S9l reduces to the following master equation for a one step process: 


h(k,t) = V(k)n(k,t) 

+ {D(k + 1 )n(k + 1, t) + B{k — 1 )n(k — 1, t) — [D(k ) + B(k)\ n(k , t)} , (S13) 
where h(k,t) = dn(k,t)/dt and the effective potential V(k) = C(fc) — (C (k)) has the property 

(V(k)} = 0. (S14) 


In what follows we shall only consider the case C (k) = Co , so equation (S13) reduces to the general form of 
the master equation for a one step process, 


. (fc, t) = D [k + 1) n (k + 1, t) + B (k — 1) n (k — 1, t) — [ D(k ) + B{k)] n(k , t). 


(S15) 


If the changes in k are small and we are only interested in solutions n (k, t) that vary slowly with k, then 
k may be treated as a continuous variable and we obtain the Fokker-Planck equation: 


dn(k,t) d , .. 1 d 2 ,, 

at = - dk [9 {k) n {k ' t)] + 2 W- [f (fc) n t)] ’ 


(S16) 


where f(k) = B(k) + D(k) and g{k ) = i?(fc) — D(k). For the stationary solutions m(k), we have the 
equation 

g(k)m(k) = [/(fc)m(fc)]. (S17) 


2 dk 


If we approximate g(k)/f(k) by Pade approximants g n {k) / f n (k) = Aq + Y^k=i (k+c k ) > stationary 
solution becomes 


m{k) = A/"exp (AqA:) (fc + Cfe) 


— 


(S18) 


k =1 


If we assume the simplest expression for D(k) and B ( k ) transition probabilities 


then 


where iV = ’ (TVi + A^ 2 ), c = | (ci + c 2 ) = 1, a = Ci — c 2 + 1, and 6 = IVi — IV 2 — 1. Also we must 
remember that in our case k starts at one. Then if A 0 = b = 0, we have the Zipf model; when A 0 ^ 0 and 
b = 0, the 7 model is gotten (equation (S51); if Aq = 0 but b ^ 0 the /3 model is obtained (equation ©); 
finally, if Aq and b are different from 0 , we have the general /?y model (equation ©)■ 

These and additional results could be obtained using the complex network language |53 55 . 


D(k) = X 1 (c 1 + k)(N 1 -k), 

(S19) 

B{k) = A 2 (c 2 + k)(N 2 - k) 

(S20) 

m{k) = Afexp(A 0 k) '^ +k j a , 

(S21) 


With respect to the distribution of equation (S 8 ), the derivation given in [15] is based on the following 
assumptions. The existence of two word regimes: A language core containing words with low rank and do 
not affect the birth of new words, and the remaining high ranked words which reduce the probability of 
new words to be used. 
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S2 Variation of words in time 

Table SI shows the most frequent words for the year 2000 with their translation and relative frequency. 
Notice that these are very similar across languages. Table S2 shows the most frequent nouns for the 
years 1700, 1800, 1900, and 2000. There are similarities across languages and across centuries, but also 
important differences. 
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Figs. [To] [l6]slaow rank trajectories of words for the languages studied, including our simulated language. 
It can be seen that the behavior is similar for all languages: words with low rank (heads) almost do not 
vary in time. Afterwards the variation in rank depends on the rank itself, approximating a scale-invariant 
random walk. Notice that there is a higher variation at all scales before 1850. Further work is required to 
measure how much this variation depends on having less data before 1850 and how much on language 
properties of the time. 

Fig. [T7] shows the distribution of relative flights for all languages. See main text for details. 
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Figure 17. Distribution of relative flights for all languages studied. A similar plot as the one 

presented in figure 6 is shown for other languages. The same color coding and details are used. 


S3 Correlation of relative frequency changes 

We studied the correlations of the relative frequency changes (flights), defined in the main text as 


A f = (k t+ i - k t ) /k t . 


We shall use a normalized version of it: 


At - (At) 

V((At-(A t » 2 )’ 


(S22) 


(S23) 


where (•) denotes average over time. This normalization ensures that both (d t ) = 0 and (dj) = 1. The 
time correlation is given by 

C T = (d t d t+T ). (S24) 

In principle, this quantity also depends on t, but usually this dependence is very weak, as in this case, 
and one can ignore it. 

In Fig. [18] we show the average of C T , of 50 different ranks chosen randomly, for different languages, 
as well as for the simulated language. We note that the correlation is very small, except for r = 0, 
where it is 1, due to the normalization chosen, and for r = 1 where a negative value, typical of bounded 
sequences, is observed for the six languages studied here. The random Gaussian model reproduces well 
these correlations except at r = 1. 
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Table 2. Lowest ranked nouns for different years (top left cell) and different languages. Note that some 
words are used not only as nouns, which can give them a higher rank. For example, ete in French is 
summer, but also the past participle of etre (to be). 


1700 

English 

German 

French 

Italian 

Spanish 

Russian 

1 

god 

Erfahrung, experience 

fait, fact 

rei, king 

fe, faith 

fleHb, day 

2 

man 

Gottesfurcht, fear of god 

dieu, god 

sez, section 

senor, mr. 

ropo^a, city 

3 

men 

Derselben, the same 

point, point 

civ, civil code 

cardenal, cardinal 

KanHTaH r b, captain 

4 

people 

Denselben, the same 

corps, body 

giudice, judge 

rey, king 

ro,n;a, year 

5 

first 

Dieselbe, the same 

amour, love 

parte, part 

dios, god 

yTpy, morning 

6 

things 

Dieselben, the same 

car, car 

comma, paragraph 

solo, single 

nojiKH, shelves 

7 

time 

Denselben, the same 

Reims, Reims 

lavoro, work 

tiempo, time 

hohb, night 

8 

world 

Menschen, people 

temps, time 

diritto, right 

san, saint 

jioma,n;eH, horses 

9 

thing 

Alter, age 

homme, man 

art, article 

duque, duke 

ropo.n'b, city 

10 

power 

Jugend, youth 

roy, king 

sentenza, judgment 

acido, acid 

Beuepy, evening 

1800 

English 

German 

French 

Italian 

Spanish 

Russian 

1 

time 

Nichts, nothing 

fait, fact 

era, era 

dios, god 

BpeMH, time 

2 

king 

Zeit, time 

point, point 

parte, part 

parte, part 

ro^y, year 

3 

man 

Art, type 

ete, summer 

tempo, time 

tiempo, time 

fleHb, day 

4 

god 

Derselben, the same 

eau, water 

prima, first 

solo, single 

ro^;a, year 

5 

first 

Menschen, people 

partie, part 

stato, state 

senor, mr. 

BpeMeHH, time 

6 

part 

Allein, alone 

corps, body 

citta, city 

hombre, man 

jnofleft, people 

7 

men 

Natur, nature 

temps, time 

repubblica, republic 

cuerpo, body 

ropo^a, city 

8 

general 

— 

terre, land 

cose, things 

vida, life 

o6pa30MT., way 

9 

people 

— 

nombre, number 

fatto, fact 

modo, mode 

3eMjm, land 

10 

place 

— 

homme, man 

luogo, place 

hombres, men 

by/jeTt, will 

1900 

English 

German 

French 

Italian 

Spanish 

Russian 

1 

time 

Selbst, even 

ete, summer 

era, era 

senor, mr. 

BpeMH, time 

2 

man 

Jahre, years 

fait, fact 

parte, part 

parte, part 

ro,n;a, year 

3 

first 

Weise, wise 

point, point 

stato, state 

ley, law 

2KH3HH, life 

4 

life 

Ersten, first 

temps, time 

legge, law 

gobierno, government 

BpeMeHH, time 

5 

men 

Recht, right 

cas, case 

prima, first 

estado, state 

o6pa30MT., way 

6 

day 

Art, type 

droit, right 

fatto, fact 

derecho, right 

by/jeT-t, will 

7 

old 

Einzelnen, individual 

loi, law 

tempo, time 

ahos, years 

TOMt, volume 

8 

years 

Frage, question 

partie, part 

vita, life 

ano, year 

ro^y, year 

9 

work 

Nichts, nothing 

Paris, Paris 

anni, age 

ciudad, city 

npaBa, right 

10 

people 

— 

France, France 

Italia, Italy 

artfculo, article 

npaBO, right 

2000 

English 

German 

French 

Italian 

Spanish 

Russian 

1 

time 

Deutschen, German 

fait, fact 

era, era 

parte, part 

BpeMH, time 

2 

first 

Jahre, years 

ete, summer 

parte, part 

ahos, years old 

tom, volume 

3 

people 

Menschen, people 

paris, Paris 

stato, state 

estado, state 

ro^;a, year 

4 

work 

Frage, question 

temps, time 

prima, first 

vida, life 

(jDe^epapHH, federation 

5 

way 

Deutschland, Germany 

pays, country 

anni, years 

ahos, years 

HCH3HH, life 

6 

life 

Jahren, years 

politique, policy 

vita, life 

nacional, national 

jieT, years 

7 

world 

Berlin, Berlin 

vie, life 

tempo, time 

tiempo, time 

uejioBeK, man 

8 

way 

Ersten, first 

france, France 

secondo, second 

social, social 

ro^y, year 

9 

state 

Entwicklung, development 

travail, work 

mo do, way 

forma, form 

pa3, time 

10 

years 

Arbeit, work 
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uejioBeKa, human 







21 


15000 - 



1900 

Year 


Figure 10. Rank variations in time of twenty words from three different scales for English. 
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Figure 11. Rank variations in time of twenty words from three different scales for German. 
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Figure 12. Rank variations in time of twenty words from three different scales for French. 
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Figure 13. Rank variations in time of twenty words from three different scales for Italian. 
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Figure 14. Rank variations in time of twenty words from three different scales for Spanish. 
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Figure 15. Rank variations in time of twenty words from three different scales for Russian. 
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Figure 16. Rank variations in time of twenty words from three different scales for our simulated 
language. 
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Figure 18. Correlations for relative frequency changes for different languages. Black line shows 
correlations for the simulated language. 







